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Microbial pathogen genomes - 
new strategies for identifying 
therapeutics and vaccine targets 

Douglas ft Smffii ^ 

Advances in higkthroughpii DNA-segusncmg techniques hBve given us the 
unprecedented abJTrty io rapidly determine ihe nucleotide sequences of entire 
bacterial g&nomes. The application of these methods to the genomes of microbial 
pathogens, combined with efficient analytical tools snd genorre-scale approaches 
. for studying gene expression, is revolutionizing our approach to the selection of 
targets for drug screening and vaccine development This is bringing new life it) this 
important, but long^egiectsd, field of research. 



The decision, several years ago, by the US Department 
of Energy; A* National msrirura ofHcalth (NTH) and 
several ^T?fffi lj = l ^ f " rga 1 Ju« ^fag affnq to gwikark upon 
prog i arm co map and sequence the human gmnrnft 
has led ed a munber of important technologies 
advances that are beginning so have an impact is amcr 
areas of biology; Among- these airvances are the devel- 
opmeftr of automated methods Sar the generation of 
large arnmniTB of zsw DKftrScqnmrmg inmnmrioii, 
computer software fix* ixpxdbf processing 2nd analys- 
ing primary sequence daca, and techniques for the 
rapid, assembly of* shotgun ^e^nenring' reads, even sum 
entire bacterial genomes. Bt^rSmT.fllgoritbns fasmr 
lariry searching allow t hr J^pjd £deuti£calaOR of tfto- 
tt^o-cn coding ^rtjii cares that sir homologous to other 
genes, the sequences of which are held in public and 
private diabases; as fmm April 1996, appreoarnarely • 
500 megnbajes (Mb) of nucleotide sequence 'were 
contained in GenBanlc, and appTCflOjnattly 200*000 
sequences were held in the SWISS-PROT/Genpept/ 
P3R database of noo-rcdundant 

p.. .i^'ng Combmcd 
Tvjth the wealth of biochemical information dra r 
is archived xn public databases, ii has become possible 
to describe rapidly the fiill rep er toire of Bcnes is a xoj- 
crobaal genome, and to predict many of die meta- 
bolic parkways that an organism may utilize. 

Progress in ihis field has been sdroulared by the inter- 
csts of the biotechnology and pharmaceutical inrius- 
mes in using gencme-^quencing ffaft as a basis for 
drug discovery. In rum, mis has led to the develop- 
ment of propn e r ax y dadbascs - conrnning genomic 
infijrma-Dcm, which provide me basis for in 5&w expen- 
meats ro identify novel taxgtrcs for drugs* and for 
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lahorarary e^exmients to identify genes char perform 
critical functions. This article mmrnariaes some recent 
- developments m this important area, focusing on bac~ 
csria] aftqueness, and provides «™p3r* so iBusnaxt 

how gw^niTi'^**e^rfy^^^fTn0 1 "fi-iJ J wannji n*Orn R 77CTPbi a1 

pathogens can be used to select targets &r vaccine and 
drag developmenL Tie overall orocesB used to pro- 
ceed from sequence generation to target "Validariori is 
illustrated in Eg. 1. 

Lerge-scale sgqpgncmg of bacterial gOTomes 

Many laboratories us* automated sample-prepar- 
ation techniques and fboresetncE-basecl gel readers 
[such as that produced by AppEed Biorystero5 Inc., 
(ABI); Fan City, C&. XJSAj for thc^ laigc^cal* 
segnennfng or baecerxal genomes, These 
have the aovannzgn thai they sot Effig^t, and relatively ' 
easy to set up and opejate. A&w laboratories use com- 
pttter-assittttd mnlcip^ ^equciicmg to achieve the 
same end 1 . In multiplex sequencing, sarriTiles censist- 
ing of pools of up m 20 pl&smitis are processed through 
sample prepaxatioc and gel sleetiaphoreesis, a* A the 
xesnlTTitg sequences axe determined from, eleftrobloc; 
of the gels by hybridization with radioactive orfinnr- 
cscendy labeled probes. This technique cam be used to 
gtnerare 40 fikos (or g"g*"»?d images) crom each 
segoencin^geL Ah^rm^h rn nVprpTyr ^ rjfj^ y jpfngfc 
deur ac producing large amounts of 'shocgnu' data, it 
is more riiforuk to set up and Dperate in. the labDia- 
tory than is fluorescence-based gel ^raeCTdng, and ii 
is not sufced a dhected-firiishing saa-cegies. ABI 
machines are used in the author^ hboraro. 27 to gerter- 
ace primer-djrrctrd reads for finishing and gap dosure. 

During the past year, a group at The lasnturs for 
Genomic Research (TIGR4 Garmersbuxg, MD, USA)' 
reported me. complete sequences of JiaonophHus 
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influenzal (1.8 Mb), a awjor cause of nsntDiy fefie- 
nons and msoingitB. especially in childrerf, and of 
M/o^inrroj gpdtalwm (0.6 Mb), which causes ure- 
. thrash Apprcsdrnarsly 1.6Mb of condemns 
• sequence from the- 4.7Mb Bstkericfa aU'gamm* has 
been pubhsheri*, and the sequencing of a tether 2 Mb 
war reported at die 1995 Genome Semenem* and 
^ (GSA-VII) mcetmgs. The g^ome of 
HAnbo&r fybri {1.7 Mb). che major cause of 
stomach, ulcers, has been sequenced by Genome 
Thrtapeudc Corporation (GTC; Wnfrham, MA, 
USA> under a privately fended micro bia3-pachogcn 
^F^£™ g f"?S™L Mo » than half (1.5 Mb) of the 
Z.B Mb^nnme of Mymbmenum kpme (roe cdoJbgic 
agent of le&rosy} has also been uejifenccd by GTC, 
^ and as available through. GcnBank, the GTC web 
STO^^//^w^cric:cnm>, and through MydDB 

contains mycobacteria) genome rmrpfr.g ^d sequence 
information 1 ''. * • 

Other microbial pathogras that are curmndy bemg 
■ sec^enced include Ndsstw g mtMtkam fUniretsity of 
Oklahoma, Norman, OK, USA), Steptaeuw* pyogenes 
(Unwmty Okkhoma), Treponema pallidum 
fLJnivnsity ofTsras, Houston TX, USA, and TIGR), 
MysDhuOerium twvsraihsis (GTC and die Sanger 
Centra, Hmxion, Cambridge, UK), and Stapkyhxamts 
Gumts [GTC, and Human Genome Sciences (HQS' 
Kockvillc, MD, USA)). 

In addition to these pathogens, die genomes of sev- 
eral archaebacteria and other non^athogens are being 
sequenced. These Snchide Methenoaxrus janasctii 
(TIGR), Pyrvaxms juTipsis (UinVersiry of Utah, Sale 
Lake Qsy, VT, USA) T Suifblobus solja^riais (Dalhnusie 
UTuvemry, HaHrax, Nova Scoria, Canada), and 
Pyrobnmium amphHum (GaKfcraia Institute of 
Teclmology, Pasadena, CA; USA, and Unrverriry of 
Calfcnua, Los Aogeles, CA, USA). Hie 1.7Mb 
genoma of me arehaeon Methmwba^rium thsmiD- 
mtotrDph iatm is near completion at GTC (ELef, 7). 
Apprariixiareiy- 2Mb of rhe 4.1Mb BanJlns xubufc 
genome has now been sequenced by a con^nmm of 
European and Japanese laboratories, a»d the project 
may be completed by the end of 1996 fBj£ S) 
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Figure l 

Row diagram ihistntfrig the process fcy which a iribrtfcl ffCTorne aaquence s 
^i^anp*ihsirrfnr^^ in targBt setec- 

bon fcr ftBrapauacs deraJopmenL The fcdmdual steps are nlmd * ftruehout 
the text into case uf vacdh§ candidates, gene products from selected targets are 
envsssed and tasted in animal models. 



fajozia and vaccina to treat & pylori infection, and 
one with Scbering-Flottgh (Union, KJ, USA), to 
develop famad-spenrum antxbiorics and vaccines. 
Although the genomic route to drug discovery for 
bacterial pathogens is new and remains improved, the 
kask paadigm (outlined below) of gene idenoficadon, 

Z ^J^-l-J* f^^terruxa Synekocys* will become involved, and that in the the ^ 



sp. 6803 was recently published 9 , 

"Wlttta the next couple of yens, therefore, we can 
aspect an explosion of baneria^genome sequence 
inronmrion fern speries representing ' a variety of 
pnyiogmenc lineages, including many pathogens. 

Pharm aceutical companies have shown conSdexnble 
interest in , using pathogen genomics to fariHctte the 
develnpmerjr of vacdnes end small-molecule chera- 
peurici. For sample, researches at GhxoVMczxnc 
have sequenced a aibsranrial fiacrion of the JBL pylori 
genome tx> as^isn in tht process ofdmg discovery. Over 
tk<e pastyeat, GTC bos formed two research alliances 
with phatmacencical cDrupanies to ake adwihogt of 
sequencea from microbial pathogens: one with 
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' ditioml research alliances between genomics ■com- 
panies and the phanraceurical industry will maredal- 
ize in this area. 

From sejjnenc* to genes 

The Sxsc task when confronted with an entire bae~ 
Ktiaj-genome sequence, is to identify all che genes. 
This can be accomplished using a variety rf tech- 
niques, bus ifimostmjxessful approaches use a corafcjU 
nation of reaenng-fiame and coebn-usage analysis, 
together with similarity searching, to identify putative 
genes with homology to previously described se- 
quences. Commonly used tools include GeneMark 10 , 
GenomeBrawser 11 , BLAST (Rxfl 12), and highly 
parallelized irnplemencidotis of the Strntb-'Watcnnan 
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a Hffiny'nX such as BLAZE, or MPrmh {Rt£ 13). In 
geneo], orgjaian-fipeofic codon usage is highly dtc- 
cHerive for bacceiial genes, but jo c^rerove use depends 
on dit csdsceacs ofsnffinmr infcnnatioji to geneate 
accuiare codon-nsage maudces," In some, subset? 
of gross TTithin an oxgaxnsrn will exfafcac coilon-usage 
pacccms that <icviate signffrrarnry fern tic norm 1 * 
Such genes axe thought to represser evohirionaxiry 
recent acqtusitiDns by phage transduction, cbnjn- 
gadoa, or same other form ofhraizontal transfer rroaa 
other organisms If enough of these genes arc Drescat, 
endon-usage tables of genomic subsets can be eon- 
srmcted m- identify them Hanslanona] scat sites can 
be idenrined by the occurrence nf start codons that 
, c 0E2cidt "with abrupt changes an codon "usage, die in- 
marion of homology to previously Ehfflaraerised 
genes, or the presence ofShin^Dalgarno sequences 13 * 
Automated analysis cools (such as GeroineBrcrwser 11 ) 
that provide! a graph ical display of open reading £ame$ 
(OHft), codon usage, daraUse horologies and other 
features* snafce ihe ask, of idrzirirying bacteria] genes 
and rKedrieiaac^h^s vdih eich odaer ia the gmcme 
relatively stmghtfbrwuxi TOm me increasing pace of 
barrens-genome sequencing, there is an emerging 
need for secoaoVpoeiaaon coob that u-iD awt-r^^ 
most of the laborious annotation process. 

5rum genes to function 

The second phase in the analysis of bacterial 
genomes is co identify tbt fbrocdori of as many genes 
as possible. Current*?, sequenct homology is the most 
PWvermJ tooL A high degree of homology between 
the putative translation product of a newly ideriiificd 
gene and an enzyme whose iuncnon has been 
thoroughly studied in other ox^msms, provides moug 
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fnppoir fcr the function of that protein, espzd&y ff h 
-4a only h^gm d. genome 
Other useful *oh mdndc programs that identrfV 
ffic^ence mot* fern databases such as Pr5St2 
<Re£ 36^ BLOCKS (Ref 17), BEAUTY 
and ProDom (Re£ 19), If one is arcerr^ ^ 
idfinafy ^one candidates, then exsunining bLhr 

i^eral to ^^exapotdn contmnsa secret 
sgnaj, ^tfnotrdnge^^ 

me tools deserved here are very good at icxerxn^ 
hoinnbgH 25-4096 of the ^ in a bSal 
genome typxcaDy feil to show significant drrrOarinr 
twth icnowa proteins. 7 
Once the set of nrnilarijy^rt^ cods has been 
one nmtt return ro rr^cc^ bidogy to 
furmerelnndatc the funcrion and egression pam*n 
of predicted genes. Cominnnr/ used approaches to 
adenDrymgcssecdal genes in an orgm^ j^.^ ^ 
nse of gene knockout rfimptkms nring rrajsrposon- 
mfitored nmiageneas, or homologous recomrmanon 
with disrupted BTOc^onsmicis that cpntam an ori- 
bion^tesBtance cassette. Gene dmTxoaons can be 
genezatedm a variety of ways, mdndln^ sophisticated 
Tat-ond-OTi' approaches that huerxin?t a gene wirh- 
^^^cingP^ar e&cts into dovratream ORft 
fKet 20). Horovcn £ gcM-by-gene approach, to the ■ 
smcry of a whole genome h cerrainly riiae consunnnB 
and labor mtensive, 

The smMa&y of! large' amouacs 0 f gsaflaa- 
s^uence mfctmanon has ammJated the devdopnttac 
ofaew ^pnaches to fimcrianal snaijras oa a geaoasc 
i^t This hai bees particulady mtt fcr aseaidiias 
anwngatmg yeast, whao^ conccncd dEbrt is bane 
made to ascenaia the function of swsry ODJF in HI 
£e»c.a«L Such snat^iss indude the: conceptually 
saaple, bnt tedmologicafly advanced, technique of " 

(PGR.}-a^pLnad gene sequences on dais slides to 
■now Ac fiuptBceace-baed dettCDon rfonandaiwe 
^tecaanon agoalj W laifdcd cDt-lA probes on 
tage nnmbas of gene! sanahaneoBsiy — peduss ctcq 
all tie geaes of an otgaaisnt". An ingaiiowTpCR- 
ba ttdapp madi to efficient sequBace^i g aa rMft _i JBe(J 
aqpiesaon analysis has neaiy been demUnsoaxc^ 
rer exanple, a tadaSqpm vomsd '^saie finger^ 
pnaang pienuB to tepl^.lndmdual gene teoek- 
oats by a global «oq> M oa-.aai l ag C ntssls approach* 
faseraoas are iadaced an w«rf m a nrain dfintsnsL ' 
j " U6dcr a ^7 of eonditjonT 

7° tanspownbop,^ □ndez-icnreaeatBd bum 
*e_pn« « ieqtti«i fer growth. A. concepmaDy 
samiar dropout technique, *hiet xues agge/^l 

_ liehaqiw tharprobs subsets of geaes fcraspedfic 

growth m the h«e, h»e abo'besn iscrihed. ih£ 
teauaqnes piwde clones fiom which agiauzt 
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» rmally, pwrtia ai^e^tingS^d 

«a«^pecBDmeay-bas e d pccniic analyst have 
been ased to jdendfyprattm components (eg. outer- 
membtaae proteins) m partially purified txnamia, 
or to idanufy spedSc pioteing ^ ^ 

l^a^monal g=l decttopaomsii. Seonences geaesued 

ndnt die gene Ieqaencts ^ 
ccprejjetL 



Target Action and vaHdation 

The techniques described i die previous section 
can be used to identify genes in specific fimcrianal 
cacegon^ tnac may represent good targets fcr dmeor 
^ devclnpmeBL la general, developing 
aevr antibiotics, one is intetesced m genes that axe 
essential under all growth condmons (and preferably 
e*en m qtifacenr alb), and fcr which inhibiLors with 
useful chemical ptopemea, snch as pesrneabfliry and 
low traaiy, can be identified One advantage of 
araagito entice sequence of a genome is dsat targets 
can be px»rinzed ffi terms of dxeir 'afdvnia and tbe 
ptpperacs of compounds that are known to interact 

^ ■*j MgL ? vtn * B 1651310 of knockout or itt 
Y**? otpresrioa eaxsrimma, addiconal bdoWcal 
Mfepnatioa can aid in narrowing dawn me fieldof 
choices. Bar example,' genes cam be selected cm the 
msb of their Bribable rales is intraeellnkr metab- 

■ B^ 3 ^ ^ * - m C 7 c 2S) orPUMA 

I™ Scribe knows rneabolk pathways 

: =W homologs of identified genes (deter* 

mmcd using is Protein DaiaBmfc^ can bt used to 
assist m meiacJecular modeling of jnHhn=n ( som e 
naoBaces Sir molecular mndcHng-cm be found a 

As more genomes as jcqnenccd, fc vrfZ become 
. possible to mentny g=nes mat are micas to a par- 
acoJar orgaaism ct of DIgail3smSj „ 

coat jo* conserved in certain troops. That, fer' 
eampie, it wfll bs possible to- use elecnanic com- 
parison Co identify geaes that arc present in H. mbii 
but norm nth^^u^vdHng bacteria «di as £ »fi, 
pwwdmg a bob fer tie development of antibioda 
specme to H. pylori. Althnn^ combaiawirial ehera- 
tttnes promise to speed tip our ability to synthesis* 

•the aeoneBfie-hasecl appiuacb described here provides 
aa avenne for the iwioaal idennfioujon and selection 
o* key targrts for tfesxapeatics development U3n- • 
rnatt ^Edarieni of me targets will, of cotne, require 
addmonal seperimenrs soch as protein exBiesstcm. 
biochcaacal-^ssay development aud * anjmal 



saafies » idcadfy those W tae most 
properties oririibirors. tM 

AckaowiedgonsBts 

The seqaenciag of MynbaBtrium lam* and' 
ti^ tnu20iU ' ^ devdopSStS 

01 W ^ Nafloaal Cento fcr Hnman Geaon* 

O^mm^ptiaanjs supported toder the Mioobial • 
^omcP^Bn^GiroNo.DE-FC02-9SERol9S7 
^2 xi^* ofIila ? a£ i» ^E«iWDimctit^R«e iB rh 
of the ^^Depnmeat ofHnngy. The sequendng of 

by Genome r .Thsrapeatics Corporation, Thanb to 
tfiad Gofldior «jnuntaas on £he maimscript. 
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